Integrative analysis of transcriptomic and proteomic data of Desulfovibrio vulgaris: a non-linear model to predict abundance of undetected proteins
نویسندگان
چکیده
MOTIVATION Gene expression profiling technologies can generally produce mRNA abundance data for all genes in a genome. A dearth of proteomic data persists because identification range and sensitivity of proteomic measurements lag behind those of transcriptomic measurements. Using partial proteomic data, it is likely that integrative transcriptomic and proteomic analysis may introduce significant bias. Developing methodologies to accurately estimate missing proteomic data will allow better integration of transcriptomic and proteomic datasets and provide deeper insight into metabolic mechanisms underlying complex biological systems. RESULTS In this study, we present a non-linear data-driven model to predict abundance for undetected proteins using two independent datasets of cognate transcriptomic and proteomic data collected from Desulfovibrio vulgaris. We use stochastic gradient boosted trees (GBT) to uncover possible non-linear relationships between transcriptomic and proteomic data, and to predict protein abundance for the proteins not experimentally detected based on relevant predictors such as mRNA abundance, cellular role, molecular weight, sequence length, protein length, guanine-cytosine (GC) content and triple codon counts. Initially, we constructed a GBT model using all possible variables to assess their relative importance and characterize the behavior of the predictive model. A strong plateau effect in the regions of high mRNA values and sparse data occurred in this model. Hence, we removed genes in those areas based on thresholds estimated from the partial dependency plots where this behavior was captured. At this stage, only the strongest predictors of protein abundance were retained to reduce the complexity of the GBT model. After removing genes in the plateau region, mRNA abundance, main cellular functional categories and few triple codon counts emerged as the top-ranked predictors of protein abundance. We then created a new tuned GBT model using the five most significant predictors. The construction of our non-linear model consists of a set of serial regression trees models with implicit strength in variable selection. The model provides variable relative importance measures using as a criterion mean square error. The results showed that coefficients of determination for our nonlinear models ranged from 0.393 to 0.582 in both datasets, providing better results than linear regression used in the past. We evaluated the validity of this non-linear model using biological information of operons, regulons and pathways, and the results demonstrated that the coefficients of variation of estimated protein abundance values within operons, regulons or pathways are indeed smaller than those for random groups of proteins. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
منابع مشابه
Prediction and Characterization of Missing Proteomic Data in Desulfovibrio vulgaris
Proteomic datasets are often incomplete due to identification range and sensitivity issues. It becomes important to develop methodologies to estimate missing proteomic data, allowing better interpretation of proteomic datasets and metabolic mechanisms underlying complex biological systems. In this study, we applied an artificial neural network to approximate the relationships between cognate tr...
متن کاملTranscriptomic and Proteomic Analysis of Arion vulgaris—Proteins for Probably Successful Survival Strategies?
The Spanish slug, Arion vulgaris, is considered one of the hundred most invasive species in Central Europe. The immense and very successful adaptation and spreading of A. vulgaris suggest that it developed highly effective mechanisms to deal with infections and natural predators. Current transcriptomic and proteomic studies on gastropods have been restricted mainly to marine and freshwater gast...
متن کاملI-3: Human Y Chromosome Proteome Project 2012 Update
The Human Genome Project has generated a blueprint for the approximately 20,300 gene-encoded proteins potentially active in any of 230 cell types that make up the human body (human proteome). However, based on the UniProtKB/Swiss-Prot database content, about 6000 of at the protein level; for many others, there is very little information related to protein function, abundance, subcellular locali...
متن کاملProteomic Analysis of Gene Expression in Basal Cell Carcinoma
Background: Basal Cell Carcinoma (BCC) is a type of non-melanoma skin cancer. Alteration in gene expression is the important event that happens in cancer cell. Detection of this event is possible by proteomics techniques. Methods: Normal and tumor tissues were taken from BCC patient. Total proteins were purified by standard methods, and proteins were separated by two-dimensional electrophoresis...
متن کاملComparative Proteomic Analysis of Two Manilkara Species Leaves Under NaCl Stress
Background: Salinity is a major environmental limiting factor, which affect agricultural production. The two Manilkara seedlings (M. roxburghiana and M. zapota) with high economic importance, could not adapt well to higher soil salinity and little is known about their proteomic mechanisms. Objectives: The mechanisms responsible ...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید
ثبت ناماگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید
ورودعنوان ژورنال:
- Bioinformatics
دوره 25 15 شماره
صفحات -
تاریخ انتشار 2009